Peter Freeman (2019 SLSW)
26 June 2019
The book R for Data Science claims that exploratory data analysis, or EDA, is a “state of mind.” More usefully, it states that “[y]our goal during EDA is to develop an understanding of your data.”
The EDA “cycle” looks something like the following:
The basic questions that one can ask are the following:
The basic questions on the previous slide motivate the statement of an important point.
If the number of predictors variables is larger than 3, one cannot visualize the native space; one can only visualize projections of that space. If those projections yield useful information, great! But if they do not: one should not give up, because information may have been lost in projection.
In other words: EDA itself is not a replacement for statistical learning. It is a means by which to build intuition prior to “turning the learning crank.”
Some of the features of histograms to keep in mind:
Some of the features of boxplots to keep in mind:
A feature of scatter plots to keep in mind: they show the locations of samples from a bivariate distribution, but unlike histograms they do not estimate that distribution itself. (For that you might go beyond typical workaday EDA and utilize kernel density estimation, which we’ll cover elsewhere.) Scatter plots do, however, indicate the level of covariance between two variables.
Covariance:
Correlation:
Note that “uncorrelated” and “independent” are not synonymous. Uncorrelated means that there is no linear association between the variables; independent means that there is no association between the variables, period.
Some tips of the trade to keep in mind when constructing scatter plots:
But remember: even if you see no apparent associations, they may still exist in the data’s native space! EDA is not a replacement for statistical learning. Say it again. Say it many times. Good.